This project was given 3 dataset from data.zip file where I, as the (potential) data scientist at MoneyLion plans to assess the loan repayment quality of the given customer in the dataset.
Full information of the dataset is referred from the dictionary. But in short,
loan.csv
This file returns 19 columns consist of the customer’s loan information along with their ssn and id. Every row represents and accepted loan application.
payment.csv
This file returns 9 columns consist of the customer’s payment information along with their ACH error codes and loan id.
clarity_underwriting_variables.csv
This file returns 54 rows and 10 columns where each row variable determine the underwriting report.
```{python}# summary table for payment.csvdef describe_full(df): numeric_stats = df.describe() categorical_stats = {col: df[col].value_counts() for col in df.select_dtypes(include=['object', 'bool']).columns} print("Numeric Statistics:") print(numeric_stats) print("\nCategorical Counts:") for col, counts in categorical_stats.items(): print(f"\n{col}:") print(counts)describe_full(loan)```
Numeric Statistics:
apr nPaidOff isFunded loanAmount \
count 573760.000000 577658.000000 577682.000000 575432.000000
mean 553.080972 0.037887 0.067480 514.245084
std 110.046159 0.333366 0.250852 320.939929
min 0.000000 0.000000 0.000000 0.000000
25% 490.000000 0.000000 0.000000 350.000000
50% 590.000000 0.000000 0.000000 500.000000
75% 601.000000 0.000000 0.000000 500.000000
max 705.590000 21.000000 1.000000 5000.000000
originallyScheduledPaymentAmount leadCost hasCF
count 577682.000000 577682.000000 577682.000000
mean 1428.897209 7.854389 0.619187
std 925.009141 12.853451 0.485587
min -816.710000 0.000000 0.000000
25% 1023.640000 3.000000 0.000000
50% 1245.250000 3.000000 1.000000
75% 1615.660000 6.000000 1.000000
max 19963.630000 200.000000 1.000000
Categorical Counts:
loanId:
loanId
LL-I-07399092 1
LL-I-09120818 1
LL-I-12495837 1
LL-I-09315207 1
LL-I-07105524 1
..
LL-I-18326270 1
LL-I-14232965 1
LL-I-06403601 1
LL-I-10713224 1
LL-I-04733921 1
Name: count, Length: 577426, dtype: int64
anon_ssn:
anon_ssn
c8bb49de1f8ff99d2ecddfb7037dc66e 35
0b87684b60c8b8f5d0bd40eb5811cd50 20
f971898a3c70e4c3c316cb4bc48b6e2e 18
6f0e71bc16f634fafcf5337cecdfb25e 16
64cd38e9cd1e06ed480b5e607494ce57 15
..
9e28174874e0ddfaab268a6fda20f419 1
ba2f64a44344fcf8b5f0c4076bb6777f 1
ebe6ec6901aa8f9f59520ac4e65e0142 1
d3c1943189c453ce5149dc5c7c921f06 1
d7e55e85266208ac4c353f42ebcde5ca 1
Name: count, Length: 459393, dtype: int64
payFrequency:
payFrequency
B 316654
W 137188
M 59092
I 33787
S 29688
Name: count, dtype: int64
applicationDate:
applicationDate
2017-01-03T18:05:40.811000 3
2016-08-19T14:07:17.373000 2
2016-08-11T03:12:23.616000 2
2016-09-12T16:56:05.319000 2
2016-11-29T23:27:55.525000 2
..
2016-08-30T23:15:03.936000 1
2015-10-16T07:06:51.886000 1
2016-08-03T00:56:19.972000 1
2016-06-22T17:48:34.161000 1
2015-11-17T22:04:20.862000 1
Name: count, Length: 577624, dtype: int64
originated:
originated
False 531676
True 46006
Name: count, dtype: int64
originatedDate:
originatedDate
2017-03-06T22:29:45.066000 2
2015-02-20T19:40:33.329000 2
2016-01-20T15:49:18.846000 1
2015-12-13T18:52:44.655000 1
2017-03-18T19:15:07.210000 1
..
2017-03-03T18:00:09.560000 1
2017-03-14T13:45:08.746000 1
2016-07-03T20:12:54.603000 1
2016-01-11T17:12:16.285000 1
2015-11-12T22:35:47.644000 1
Name: count, Length: 46042, dtype: int64
approved:
approved
False 537646
True 40036
Name: count, dtype: int64
loanStatus:
loanStatus
Withdrawn Application 450984
Rejected 85070
Paid Off Loan 11427
External Collection 11334
New Loan 8112
Internal Collection 5567
Returned Item 1182
CSR Voided New Loan 1026
Settlement Paid Off 708
Credit Return Void 704
Customer Voided New Loan 504
Settled Bankruptcy 326
Pending Paid Off 169
Charged Off Paid Off 160
Pending Application Fee 5
Pending Rescind 4
Pending Application 4
Voided New Loan 2
Charged Off 1
Customver Voided New Loan 1
Settlement Pending Paid Off 1
Name: count, dtype: int64
state:
state
OH 90496
IL 66710
TX 49729
MO 49249
WI 40349
MI 34635
TN 32991
NC 26962
FL 25896
IN 25781
SC 23925
CA 22691
NV 11477
PA 9723
VA 9197
NJ 7641
UT 6742
AL 4327
MS 3786
CO 3615
LA 3431
AZ 3355
NM 3231
KY 2911
SD 2212
MN 2078
GA 1763
OK 1758
WY 1715
CT 1468
WA 1384
KS 1227
IA 1043
DE 880
ID 722
RI 689
NE 635
HI 621
AK 290
ND 207
NY 3
MD 3
OH-TEST 1
TX-TEST 1
Name: count, dtype: int64
leadType:
leadType
bvMandatory 475001
lead 72673
organic 22851
prescreen 4515
rc_returning 2069
california 479
lionpay 26
repeat 24
instant-offer 22
express 22
Name: count, dtype: int64
fpStatus:
fpStatus
Checked 32978
No Payments 11427
Rejected 5522
No Schedule 1323
Cancelled 249
Skipped 218
Pending 5
Returned 1
Name: count, dtype: int64
clarityFraudId:
clarityFraudId
561e95f7e4b0efa8a6cdc975 15
5675c05de4b08b757e32e3b4 14
583d9d11e4b0bb9d66559063 13
581101f3e4b04db4de10a616 12
5666318be4b0100e879ac2f0 10
..
579a6118e4b0d050242220df 1
577e4a29e4b06e5979673c41 1
57dad980e4b01a79d8b38492 1
5748eaf5e4b061d2b9c3f275 1
564f801ae4b05a02bef951e8 1
Name: count, Length: 314915, dtype: int64
--------------- `approved` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
False
537646
93.07
93.07
93.07
93.07
True
40036
6.93
100.00
6.93
100.00
Total
577682
100.00
100.00
100.00
100.00
show code
```{python}freq(loan, 'originated')```
--------------- `originated` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
False
531676
92.04
92.04
92.04
92.04
True
46006
7.96
100.00
7.96
100.00
Total
577682
100.00
100.00
100.00
100.00
show code
```{python}freq(loan, 'loanStatus')```
--------------- `loanStatus` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
CSR Voided New Loan
1026.0
0.18
99.55
0.18
99.48
Charged Off
1.0
0.00
100.00
0.00
100.00
Charged Off Paid Off
160.0
0.03
100.00
0.03
100.00
Credit Return Void
704.0
0.12
99.80
0.12
99.73
Customer Voided New Loan
504.0
0.09
99.88
0.09
99.82
Customver Voided New Loan
1.0
0.00
100.00
0.00
100.00
External Collection
11334.0
1.96
96.80
1.96
96.73
Internal Collection
5567.0
0.96
99.17
0.96
99.10
New Loan
8112.0
1.41
98.20
1.40
98.14
Paid Off Loan
11427.0
1.98
94.84
1.98
94.77
Pending Application
4.0
0.00
100.00
0.00
100.00
Pending Application Fee
5.0
0.00
100.00
0.00
100.00
Pending Paid Off
169.0
0.03
99.97
0.03
99.97
Pending Rescind
4.0
0.00
100.00
0.00
100.00
Rejected
85070.0
14.74
92.86
14.73
92.79
Returned Item
1182.0
0.20
99.37
0.20
99.31
Settled Bankruptcy
326.0
0.06
99.94
0.06
99.94
Settlement Paid Off
708.0
0.12
99.67
0.12
99.61
Settlement Pending Paid Off
1.0
0.00
100.00
0.00
100.00
Voided New Loan
2.0
0.00
100.00
0.00
100.00
Withdrawn Application
450984.0
78.12
78.12
78.07
78.07
NaN
NaN
NaN
NaN
0.07
99.88
Total
577682.0
100.00
100.00
100.00
100.00
show code
```{python}freq(loan, 'leadType')```
--------------- `leadType` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
bvMandatory
475001
82.23
82.23
82.23
82.23
lead
72673
12.58
94.81
12.58
94.81
organic
22851
3.96
98.76
3.96
98.76
prescreen
4515
0.78
99.54
0.78
99.54
rc_returning
2069
0.36
99.90
0.36
99.90
california
479
0.08
99.98
0.08
99.98
lionpay
26
0.00
99.99
0.00
99.99
repeat
24
0.00
99.99
0.00
99.99
instant-offer
22
0.00
100.00
0.00
100.00
express
22
0.00
100.00
0.00
100.00
Total
577682
100.00
100.00
100.00
100.00
show code
```{python}freq(loan, 'leadType')```
--------------- `leadType` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
bvMandatory
475001
82.23
82.23
82.23
82.23
lead
72673
12.58
94.81
12.58
94.81
organic
22851
3.96
98.76
3.96
98.76
prescreen
4515
0.78
99.54
0.78
99.54
rc_returning
2069
0.36
99.90
0.36
99.90
california
479
0.08
99.98
0.08
99.98
lionpay
26
0.00
99.99
0.00
99.99
repeat
24
0.00
99.99
0.00
99.99
instant-offer
22
0.00
100.00
0.00
100.00
express
22
0.00
100.00
0.00
100.00
Total
577682
100.00
100.00
100.00
100.00
show code
```{python}nPaidOff_mean = round(loan['nPaidOff'].dropna().mean() * 100, 2)print(f"Average pay off rate: {nPaidOff_mean}")```
Average pay off rate: 3.79
loan.csv shows a high-risk lending operation due to very low approval where 40,036 loan was approved out of 577,682 it has 6.93% approval rate. Although relatively high APR (median of 590%), only 7.96% Originated from applications. High APRs, low approval rates, low originated rates, and low pay off rate suggesting this is a high-risk lending operation indicate majority are subprime borrower base with significant credit risk.
Assuming this is in USD (since all the states mentioned are from United States). With loan amounts generally being higher than scheduled payments. The dashed line indicate the maximum for loan amount and payment amount. Maximum loan is $5,000 while maximum original payment shows the 4x multiplier from interest and fees resulting in $19,963.63 payment amount.
show code
```{python}freq(loan, 'payFrequency')```
--------------- `payFrequency` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
B
316654.0
54.94
54.94
54.81
54.81
I
33787.0
5.86
94.85
5.85
94.64
M
59092.0
10.25
88.99
10.23
88.79
S
29688.0
5.15
100.00
5.14
99.78
W
137188.0
23.80
78.74
23.75
78.56
NaN
NaN
NaN
NaN
0.22
100.00
Total
577682.0
100.00
100.00
100.00
100.00
Majority of 54.94% or 316,654 loans were paid biweekly (B) where least common were paid semi-monthly (S). However, there are mising data in this variable. lets investigate!
show code
```{python}# This might be used later, hence i create this function def view_missing_frequency(df, column): missing_freq = df[df[column].isna()].head() print(missing_freq.to_string())view_missing_frequency(loan, 'payFrequency')```
loanId anon_ssn payFrequency apr applicationDate originated originatedDate nPaidOff approved isFunded loanStatus loanAmount originallyScheduledPaymentAmount state leadType leadCost fpStatus clarityFraudId hasCF
214 LL-I-08119905 8d94d36f11e4b203189e1694b52b6f61 NaN 590.0 2016-03-25T17:32:23.432000 False NaN 0.0 False 0 Withdrawn Application 300.0 0.0 OH organic 0 No Payments 56f57622e4b030e93c770970 1
821 LL-I-08854991 dd5291dc42c18b5d97b6747cf5102fd4 NaN 590.0 2016-04-19T23:20:40.375000 False NaN 0.0 False 0 Withdrawn Application 300.0 0.0 IN organic 0 No Payments 5716bd44e4b0cd57626777c6 1
1461 LL-I-11401749 9d2851e8d30add57fb6cc6722e0f25d7 NaN 590.0 2016-08-24T00:46:42.965000 False NaN 0.0 False 0 Rejected 300.0 0.0 TN organic 0 No Payments 57bcee6ee4b033e6e6541c1a 1
1961 LL-I-11826560 40a58015111e3d9599975322aff494c1 NaN 590.0 2016-09-07T11:49:25.009000 False NaN 0.0 False 0 Rejected 300.0 0.0 CA organic 0 No Payments 57ce6808e4b092499fe7209e 1
2136 LL-I-16240119 fad43a089c335acb76aa6b9aaf111a0a NaN 645.0 2017-01-25T22:59:58.767000 False NaN 0.0 False 0 Withdrawn Application 500.0 0.0 KY organic 0 No Payments 58548f8ee4b08f245ec11450 1
Apparently, the first 5 approved variable is false, lets call all the unique values in payfrequency.
show code
```{python}unique_values = loan[loan['payFrequency'].isna()]['approved'].unique()print(f"All approved values after filtering pay frequency: {unique_values}")```
All approved values after filtering pay frequency: [False]
So it seems all NA from pay frequency is a result of unapproved loan.
--------------- `isCollection` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
False
675469
97.98
97.98
97.98
97.98
True
13895
2.02
100.00
2.02
100.00
Total
689364
100.00
100.00
100.00
100.00
show code
```{python}freq(payment,'paymentReturnCode')```
--------------- `paymentReturnCode` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
C01
87.0
0.28
98.95
0.01
99.95
C02
10.0
0.03
99.85
0.00
99.99
C03
34.0
0.11
99.82
0.00
99.99
C05
106.0
0.34
98.67
0.02
99.94
C07
2.0
0.01
99.99
0.00
100.00
LPP01
7.0
0.02
99.89
0.00
100.00
MISSED
537.0
1.70
95.54
0.08
99.80
R01
22866.0
72.51
72.51
3.32
98.74
R02
2761.0
8.76
81.27
0.40
99.14
R03
318.0
1.01
96.55
0.05
99.84
R04
39.0
0.12
99.71
0.01
99.99
R06
6.0
0.02
99.91
0.00
100.00
R07
160.0
0.51
98.34
0.02
99.92
R08
2259.0
7.16
88.43
0.33
99.47
R09
176.0
0.56
97.83
0.03
99.90
R10
620.0
1.97
93.84
0.09
99.72
R13
2.0
0.01
99.98
0.00
100.00
R15
3.0
0.01
99.98
0.00
100.00
R16
1085.0
3.44
91.88
0.16
99.63
R19
1.0
0.00
100.00
0.00
100.00
R20
83.0
0.26
99.21
0.01
99.96
R29
4.0
0.01
99.96
0.00
100.00
R99
60.0
0.19
99.40
0.01
99.97
RAF
58.0
0.18
99.58
0.01
99.98
RBW
5.0
0.02
99.95
0.00
100.00
RFG
3.0
0.01
99.97
0.00
100.00
RIR
1.0
0.00
99.99
0.00
100.00
RUP
6.0
0.02
99.93
0.00
100.00
RWC
7.0
0.02
99.87
0.00
99.99
RXL
1.0
0.00
100.00
0.00
100.00
RXS
226.0
0.72
97.27
0.03
99.88
NaN
NaN
NaN
NaN
95.43
95.43
Total
689364.0
100.00
100.00
100.00
100.00
show code
```{python}freq(payment,'paymentReturnCode')```
--------------- `paymentReturnCode` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
C01
87.0
0.28
98.95
0.01
99.95
C02
10.0
0.03
99.85
0.00
99.99
C03
34.0
0.11
99.82
0.00
99.99
C05
106.0
0.34
98.67
0.02
99.94
C07
2.0
0.01
99.99
0.00
100.00
LPP01
7.0
0.02
99.89
0.00
100.00
MISSED
537.0
1.70
95.54
0.08
99.80
R01
22866.0
72.51
72.51
3.32
98.74
R02
2761.0
8.76
81.27
0.40
99.14
R03
318.0
1.01
96.55
0.05
99.84
R04
39.0
0.12
99.71
0.01
99.99
R06
6.0
0.02
99.91
0.00
100.00
R07
160.0
0.51
98.34
0.02
99.92
R08
2259.0
7.16
88.43
0.33
99.47
R09
176.0
0.56
97.83
0.03
99.90
R10
620.0
1.97
93.84
0.09
99.72
R13
2.0
0.01
99.98
0.00
100.00
R15
3.0
0.01
99.98
0.00
100.00
R16
1085.0
3.44
91.88
0.16
99.63
R19
1.0
0.00
100.00
0.00
100.00
R20
83.0
0.26
99.21
0.01
99.96
R29
4.0
0.01
99.96
0.00
100.00
R99
60.0
0.19
99.40
0.01
99.97
RAF
58.0
0.18
99.58
0.01
99.98
RBW
5.0
0.02
99.95
0.00
100.00
RFG
3.0
0.01
99.97
0.00
100.00
RIR
1.0
0.00
99.99
0.00
100.00
RUP
6.0
0.02
99.93
0.00
100.00
RWC
7.0
0.02
99.87
0.00
99.99
RXL
1.0
0.00
100.00
0.00
100.00
RXS
226.0
0.72
97.27
0.03
99.88
NaN
NaN
NaN
NaN
95.43
95.43
Total
689364.0
100.00
100.00
100.00
100.00
show code
```{python}freq(payment,'paymentStatus')```
--------------- `paymentStatus` Frequency Table ---------------
Freq
% Valid
% Valid Cum.
% Total
% Total Cum.
Cancelled
270334.0
51.46
51.46
39.21
39.21
Checked
209621.0
39.90
91.37
30.41
69.62
Complete
1.0
0.00
100.00
0.00
100.00
Pending
9241.0
1.76
99.28
1.34
99.45
Rejected
32330.0
6.15
97.52
4.69
98.11
Rejected Awaiting Retry
18.0
0.00
100.00
0.00
100.00
Returned
1.0
0.00
100.00
0.00
100.00
Skipped
3761.0
0.72
100.00
0.55
100.00
NaN
NaN
NaN
NaN
23.80
93.42
Total
689364.0
100.00
100.00
100.00
100.00
It seems unusual how, principal, fees and payment amount has a negative value. Although, this could be because of data entry errors, refunds or reversals and system errors in calculation.